Associations Between Radiation Oncologist Demographic Factors and Segmentation Similarity Benchmarks: Insights From a Crowd-Sourced Challenge Using Bayesian Estimation

PURPOSE The quality of radiotherapy auto-segmentation training data, primarily derived from clinician observers, is of utmost importance. However, the factors influencing the quality of clinician-derived segmentations are poorly understood; our study aims to quantify these factors. METHODS Organ at risk (OAR) and tumor-related segmentations provided by radiation oncologists from the Contouring Collaborative for Consensus in Radiation Oncology data set were used. Segmentations were derived from five disease sites: breast, sarcoma, head and neck (H&N), gynecologic (GYN), and GI. Segmentation quality was determined on a structure-by-structure basis by comparing the observer segmentations with an expert-derived consensus, which served as a reference standard benchmark. The Dice similarity coefficient (DSC) was primarily used as a metric for the comparisons. DSC was stratified into binary groups on the basis of structure-specific expert-derived interobserver variability (IOV) cutoffs. Generalized linear mixed-effects models using Bayesian estimation were used to investigate the association between demographic variables and the binarized DSC for each disease site. Variables with a highest density interval excluding zero were considered to substantially affect the outcome measure. RESULTS Five hundred seventy-four, 110, 452, 112, and 48 segmentations were used for the breast, sarcoma, H&N, GYN, and GI cases, respectively. The median percentage of segmentations that crossed the expert DSC IOV cutoff when stratified by structure type was 55% and 31% for OARs and tumors, respectively. Regression analysis revealed that the structure being tumor-related had a substantial negative impact on binarized DSC for the breast, sarcoma, H&N, and GI cases. There were no recurring relationships between segmentation quality and demographic variables across the cases, with most variables demonstrating large standard deviations. CONCLUSION Our study highlights substantial uncertainty surrounding conventionally presumed factors influencing segmentation quality relative to benchmarks.


INTRODUCTION
Segmentation (also termed contouring) of regions of interest (ROIs) on medical images is crucial for radiotherapy planning. 1Importantly, accurate segmentation of organs at risk (OARs) and tumor-related (ie, target) structures is required to optimize radiotherapeutic efficacy.Segmentation is often performed by clinicians, such as radiation oncologists.However, clinician-derived manual segmentation is a time-and labor-intensive task, thereby prompting the increasing development of artificial intelligence (AI)-based methods for auto-segmentation. 2 The Contouring Collaborative for Consensus in Radiation Oncology (C3RO), a large-scale crowdsourcing challenge for radiotherapy segmentation, demonstrated that nonexpert consensus segmentations could quantitatively approximate expert consensus segmentations in a variety of disease sites, 3 thereby motivating the potential use of a large number of lower-quality segmentations in place of a small number of high-quality segmentations for AI model training.Notably, segmentations were highly variable among the participants of C3RO, suggesting underlying factors associated with resultant segmentation quality.
Despite AI advancements, human clinicians will likely be involved in the radiotherapy segmentation process for the foreseeable future, both as suppliers of ground truth for algorithmic training and as the final arbiters of quality.Understanding the characteristics of clinicians associated with superior segmentation performance could help guide training, inform the design of auto-segmentation tools, and ultimately improve the quality of care provided to patients.While some data do suggest that clinician experience is associated with improved radiotherapy outcomes, [4][5][6] no studies have directly examined underlying factors related to segmentation quality.Therefore, we aim to investigate whether demographic factors of a large number of radiation oncologists are associated with improved segmentation quality through a secondary analysis of C3RO.

Study Participants and Demographic Variables
Participants in C3RO were categorized as recognized experts or nonexperts.Recognized experts were identified by the C3RO organizers as board-certified physicians who participated in the development of national guidelines and/ or contributed to extensive scholarly activities within a specific disease site.Nonexperts were any participants not categorized as an expert for that disease site.For this study, nonexpert participants from each separate disease site of the C3RO database, namely, the breast, sarcoma, head and neck (H&N), gynecologic (GYN), and GI cases, were selected for the analysis.Greater details on the publicly available C3RO data set can be found in the corresponding data descriptor. 7Self-reported demographic variables of interest from the participants were initially collected through an intake survey performed on REDCap. 8Informed by previous research, 9,10 various demographic variables were collected for physicians in this study (Table 1).Before use in the analysis, nonexpert participants were filtered out of the data set if they were trainees (eg, residents) or nonphysicians (eg, radiation therapists, medical physicists, other).The primary practice description variable was converted to a binary format by grouping academic/university (academic) into one group and all others into a separate group (nonacademic).

Segmentation Evaluation
All ROIs from all disease sites in the C3RO data set were used for this analysis (Data Supplement, Table S1).Notably, participants generated ROI segmentations based principally on contrast-enhanced radiotherapy planning computed tomography scans.Participants were provided a short clinical history for each case.Additional case-specific considerations included the following: the breast case not receiving contrast, the H&N and GI cases having positron emission tomography scans available for reference, and the sarcoma case having a magnetic resonance imaging scan available for reference.For each nonexpert ROI, we calculated segmentation quality by comparing the nonexpert segmentation with the consensus of experts as derived using the Simultaneous Truth and Performance Level Estimation (STAPLE) algorithm 11 (Fig 1).The number of expert observers used for each ROI consensus segmentation is presented in the Data Supplement (Table S1).It should be noted that as with any segmentation study, there was no definitive underlying ground truth set of segmentations we could reference.Although experts were subjectively determined in the original C3RO study, they demonstrated significantly improved interobserver variability (IOV) compared with their nonexpert counterparts. 3Therefore, the expert STAPLE can be considered as a reference standard segmentation.We used the existing Neuroimaging Informatics Technology Initiative structure files for comparisons, which were previously converted from Digital Imaging and Communications in Medicine using DICOMRTTool. 12The Dice similarity coefficient (DSC) was used as the main metric for comparison because of its ubiquity. 1We also investigated two metrics of surface similarity, the surface DSC (SDSC) and 95% Hausdorff distance (HD95), for additional experiments; SDSC tolerance values for each ROI were determined from the pairwise average surface distance of the expert  Derivation of binarized structure segmentation quality for each observer.Each observer could segment multiple structures, that is, organs at risk and tumor volumes.Observer segmentations (red volume) were compared with a reference standard derived from a consensus segmentation of experts (green volume) using the DSC.Segmentation metrics were then stratified into being greater than or equal to (yes-1) or below (no-0) a previously derived expert-derived IOV cutoff value for that particular region of interest.In this example, the primary gross tumor volume structure for the head and neck case is shown.A similar process was used to derive binarized values for surface DSC.DSC, Dice similarity coefficient; IOV, interobserver variability.
segmentations.Metrics were calculated using the surfacedistance Python package v. 0.1 13 and in-house Python code (Python v. 3.11.4).
To ensure that metrics were comparable across ROIs, metrics were stratified into binary groups on the basis of previously established ROI-specific expert-derived IOV cutoffs-cutoffs were calculated as the median of pairwise metric values for all available expert segmentations. 7amely, if the metric for a given ROI was greater than or equal to the ROI-specific expert IOV, it was classified as 1, otherwise, 0 (Fig 1).Finally, for each ROI, we calculated the percentage of observers who were able to cross the expert IOV cutoff.

Bayesian Regression Analysis
Generalized linear mixed-effects models with Bayesian estimation were used to investigate the relationship between demographic variables and binarized segmentation quality metrics.The stratified binary segmentation quality metric acted as the dependent variable for the models.The key independent variables were practice location, primary practice type, number of radiation oncologist colleagues, presence of another radiation oncologist during clinic, actively treat disease site, and years of practice.Notably, exploratory correlative analysis (Data Supplement, Figs S1-S5) revealed high relative correlation between academic affiliation and primary practice type; therefore, academic affiliation was not included as a covariate to facilitate model parsimony.An additional binary categorical variable, ROI type, was added as an independent variable to indicate if the ROI was an OAR or tumor volume.Furthermore, models were corrected for self-identified sex and self-identified race by including them as independent variables in the models.A random intercept was used in the models to account for the various observers who could segment multiple structures on the same image.Any empty values for numerical variables were imputed to the median value relative to the total number of observations.Finally, numerical variables were Z-score normalized.
The Python package Bambi v. 0.12.0, 14 which is built on top of the robust Markov chain Monte Carlo (MCMC) library PyMC3, 15 was used for all regression analyses.For each disease site, the regression formula was defined as where Y ij is the dependent variable for observation i nested within observer j, which follows a Bernoulli distribution with success probability p ij ; logitðp ij Þ is the log odds of the success probability; b0 is the overall intercept; u j is the random intercept for observer j; b1, …, b9 are the fixed effect coefficients for the predictors.
For each MCMC Bayesian regression model, 10,000 samples were drawn from four chains with a tuning set of 1,500 iterations for a total of 46,000 samples drawn for each model.Weakly informative priors as determined using the Bambi package were intelligently generated for all model terms by loosely scaling them to the observed data. 14Computations were performed across six cores of an Intel Core i7-8700 Processor.
The ArviZ v. 0.15.1 16 Python library was used to derive summary data for the posterior distribution.Point estimates (posterior means) and assessments of uncertainty (posterior standard deviation) were calculated for each variable.In addition, the 89% highest density interval (HDI) was calculated; a value of 89% was selected as suggested in recent literature. 17,18Demographic variables for which the HDI did not include zero were considered to have a substantial impact on the outcome measure of interest and could be interpreted as loosely analogous to the frequentist notion of statistical significance.

Data and Code Availability
All C3RO data, including the original demographic factors and segmentation data, are available on Figshare (DOI 5 doi.org/10.6084/m9.figshare.21074182).All Python code used for this study are available on GitHub. 19Corresponding newly created data can also be found on Figshare (DOI 5 doi.org/10.6084/m9.figshare.24021591).

Study Participants
After filtering out structures from noneligible observers, 574, 110, 452, 112, and 48 ROI structure observations from practicing radiation oncologist observers remained for the analysis for the breast, sarcoma, H&N, GYN, and GI cases, respectively (Fig 2).Descriptive statistics for the clinician observers used in our study are presented in the Data Supplement (Tables S2 and S3).

DISCUSSION
To date, there are limited standardized measures to evaluate radiotherapy-related segmentation quality.Nissen et al 20 recently proposed the utilization of the Jaccard Index for longitudinal quantitative evaluation.However, the inherent quality discerned from these metrics in their raw numerical form often varies on the basis of the specific ROI.For example, a DSC of 0.80 for a particularly simple OAR may be less desirable than a DSC of 0.80 for a particularly difficult tumor volume.However, stratification of evaluation metrics, as we have performed in our study, allows for ROI-specific thresholds that act as rough measures of acceptability.Notably, our ROIspecific thresholds are derived from reference standard measurements provided by experts, which were established to have significantly improved segmentation consistency compared with nonexperts. 3When stratified by previously defined expert IOV cutoffs, the ROIs with the lowest percentage of observers who were able to cross cutoffs were often tumor volumes.This is consistent with the generally held notion that tumor volumes, which often require domain-specific knowledge, are inherently more difficult to segment than OARs.standardized automated tumor segmentation methods, which so far have been less developed and used than their OAR counterparts. 9terestingly, results were inconsistent and mostly nonsubstantial for the majority of demographic variables across disease sites.Historically, greater institutional support has been perceived to be important for radiotherapy quality. 9Therefore, our mostly negative results for proxy variables intuitively linked to greater institutional resource support, such as academic practice and prevalence of radiation oncologist colleagues, are particularly surprising.These findings suggest that the auto-segmentation community should reconsider heuristic choices, such as those based on annotator qualities, when choosing reference segmentations for algorithm development.It may instead be preferable to use consensus segmentations, which have demonstrated quantitatively reliable results as shown in our previous work, 3 in place of single-annotator segmentations for prospective data collection efforts.Finally, while existing literature regarding observer demographic impact on radiotherapy-related tasks is sparse, it warrants mentioning that one of the few studies in this area found no significant relationship between demographic factors and the resultant quality of radiotherapy plans. 23nother study investigating lung disease annotations also demonstrated no impact of observer demographics on segmentation quality. 24These studies echo our mostly null results.
While most of the investigated demographic variables were nonsubstantial with large degrees of uncertainties, there were a few results that we believe warrant further discussion.Academic practice in the GYN case was substantially negatively associated with SDSC performance; a nonsubstantial negative association was echoed in most of the disease sites.This could imply, perhaps contrary to common assumptions, that community clinicians produce segmentations more closely aligned with our reference standard and, presumably, more consistent with contouring guidelines.Moreover, White racial self-identification was substantially positively associated with DSC in the H&N case, which exhibited conflicting relationships in other disease sites.It is crucial to emphasize that the association between racial selfidentification-a complex social construct which has been drastically simplified in this binary variable-and segmentation performance likely reflects broader institutional or regional conformance to contouring guidelines, rather than a reductive racial skill disparity.Notably, US and European organizations, which would have over-representation of White racial self-identification, have the largest proportion of contouring guideline endorsements. 25The heterogeneity within C3RO's categorization of non-US observers might have confounded these relationships.In addition, the presence of a radiation oncologist colleague was substantially positively associated with DSC in the GI case; this positive relationship seemed to hold for most disease sites.These results suggest that clinicians who likely participate in consensus decision making tend to create segmentations closer to our reference standard and thus are likely to adhere to guidelines.Perplexingly, years of practice was found to have a consistently negative (though nonsubstantial) impact on DSC across the various disease sites.This may be because recent clinician graduates are more likely to adhere to contouring guidelines.Finally, our study did not show that treatment of a particular disease site was substantially associated with superior segmentation quality; in fact, it often demonstrated a negative correlation.5][6] However, the variable did not assess treatment frequency for the specific site, thereby potentially introducing heterogeneity in its interpretation and ultimately diminishing its utility.
Our study is not without limitations.First, we relied on an existing data set with inherent constraints.While boasting an unprecedented number of radiation oncologist observers, C3RO only principally used a single imaging modality from one representative patient per disease site.While this provides a dedicated reference standard, demographic relationships could change depending on a variety of underlying patient-related factors.Moreover, the C3RO intake survey was self-reported and requested limited demographic information.For example, direct indicators of treatment volume, which have been shown in previous studies to be strongly correlated with patient outcomes, 4 were not collected because of the high potential for recall bias.Similarly, variables related to the annotator's initial clinical training and current workflow, that is, routine use of contour guidelines/resources/software and access to multiple imaging modalities, would have also likely been highly informative but were not collected.Second, we have relied exclusively on conventional geometry-based metrics of segmentation quality, which have been noted to have flaws in the assessment of radiotherapy-related structures. 1 Future studies should investigate metrics more closely tied to relevant patient outcomes, such as dose-volume histogram measures.On a related note, how to best define segmentation quality in a quantitative manner, and subsequently how to improve it, remains an open question.We hope to mitigate some of these issues by binarizing our outcome segmentation quality variable and thus calibrating the value relative to a reference standard baseline.We fully acknowledge that this methodology has flaws, principally in that edge cases may be unfairly penalized or rewarded.Furthermore, our definition of a reference standard baseline is a subjective metric derived from our own data set.Specifically, our study presupposes expert consensus segmentations as ideal quality benchmarks.Large deviations from this assumption could indicate that our results simply reflect expected segmentation similarity variations secondary to clinical practice variation.A final limitation of our study lies in our reliance on weakly informative priors for our Bayesian analysis, primarily because of insufficient existing data to extract meaningful priors.Nevertheless, our current data can serve as valuable priors for future Bayesian analyses.
In conclusion, we used an extensive number of radiation oncologist observers in several disease sites to probe trends between common demographic variables and segmentation quality using generalized linear mixed-effects models with Bayesian estimation.Tumor-related structures were, as expected, more difficult to segment than OARs.However, results for demographic factors were mixed and exhibited high uncertainty as evident by large posterior standard deviations and wide HDIs.Surprisingly, there were no obvious recurring relationships for conventionally presumed factors influencing segmentation quality-this may incentivize the research community to reconsider heuristic choices when selecting reference segmentations for auto-segmentation development.
While stark variations in quantitative performance among observers compared with our reference standard segmentations can be observed, it is still unclear if and how demographic factors influence segmentation similarity to these benchmarks.Given the anticipated scenario that auto-segmentation algorithms will require humans in the loop in some capacity, these factors are still likely important to understand and should be investigated in prospective analyses of auto-segmentation interaction.Future studies should investigate a greater number of demographic variables (eg, direct indicators of treatment volume), a greater number of patients and imaging modalities, and alternative metrics of segmentation acceptability (eg, dosimetric indicators).
FIG1.Derivation of binarized structure segmentation quality for each observer.Each observer could segment multiple structures, that is, organs at risk and tumor volumes.Observer segmentations (red volume) were compared with a reference standard derived from a consensus segmentation of experts (green volume) using the DSC.Segmentation metrics were then stratified into being greater than or equal to (yes-1) or below (no-0) a previously derived expert-derived IOV cutoff value for that particular region of interest.In this example, the primary gross tumor volume structure for the head and neck case is shown.A similar process was used to derive binarized values for surface DSC.DSC, Dice similarity coefficient; IOV, interobserver variability.

FIG 2 .
FIG 2. Flow diagrams showing the final number of structure segmentations evaluated for each disease site.Breast, sarcoma, H&N, GYN, and GI cases are shown in panels (A-E), respectively.GYN, gynecologic; H&N, head and neck; N, number of nonexpert structure segmentations; O, number of unique nonexpert observers.

TABLE 1 .
Demographic Variables Examined in This Study Variable Description Practice location Geographic location where participant actively practices.Binary variable with possible values of US or non-US Sex Self-identified sex.Binary variable with possible values of male or female.Original variable included nonbinary as an option but was not selected by any participants Race Self-identified race.Binary variable with possible values of White or non-White Academic affiliation Whether the participant actively holds an academic affiliation.Binary variable with possible values of yes or no

TABLE 2 .
Generalized Linear Mixed-Effects Models With Bayesian Estimation Results Using Binarized Dice Similarity Coefficient as the Outcome Variable NOTE.Model coefficient values are shown for each variable.Reference variables for categorical variables are shown in brackets next to the variable name.Sign value in posterior mean indicates positive or negative correlation of variable with outcome.Posterior SD indicates uncertainty around posterior mean.Eighty-nine percent HDI is shown in parentheses after posterior mean.Variables in bold indicate that HDI does not contain zero and is considered to have a substantial impact on the outcome measure of interest.Abbreviations: HDI, highest density interval; ROI, regions of interest; SD, standard deviation.

TABLE 3 .
Generalized Linear Mixed-Effects Models With Bayesian Estimation Results Using Binarized Surface Dice Similarity Coefficient as the Outcome Variable Model coefficient values are shown for each variable.Reference variables for categorical variables are shown in brackets next to the variable name.Sign value in posterior mean indicates positive or negative correlation of variable with outcome.Posterior SD indicates uncertainty around posterior mean.Eighty-nine percent HDI is shown in parentheses after posterior mean.Variables in bold indicate that HDI does not contain zero and is considered to have a substantial impact on the outcome measure of interest.Abbreviations: HDI, highest density interval; ROI, regions of interest; SD, standard deviation.